Not as Easy as It Seems: Automating the Construction of Lexical Chains Using Roget's Thesaurus

نویسندگان

  • Mario Jarmasz
  • Stan Szpakowicz
چکیده

Morris and Hirst [10] present a method of linking significant words that are about the same topic. The resulting lexical chains are a means of identifying cohesive regions in a text, with applications in many natural language processing tasks, including text summarization. The first lexical chains were constructed manually using Roget’s International Thesaurus. Morris and Hirst wrote that automation would be straightforward given an electronic thesaurus. All applications so far have used WordNet to produce lexical chains, perhaps because adequate electronic versions of Roget’s were not available until recently. We discuss the building of lexical chains using an electronic version of Roget’s Thesaurus. We implement a variant of the original algorithm, and explain the necessary design decisions. We include a comparison with other implementations.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Roget's Thesaurus as a Lexical Resource for Natural Language Processing

WordNet proved that it is possible to construct a large-scale electronic lexical database on the principles of lexical semantics. It has been accepted and used extensively by computational linguists ever since it was released. Some of its applications include information retrieval, language generation, question answering, text categorization, text classification and word sense disambiguation. I...

متن کامل

Text Segmentation Using Roget-Based Weighted Lexical Chains

In this article we present a new method for text segmentation. The method relies on the number of lexical chains (LCs) which end in a sentence, which begin in the following sentence and which traverse the two successive sentences. The lexical chains are based on Roget’s thesaurus (the 1987 and the 1911 version). We evaluate the method on ten texts from the DUC 2002 conference and on twenty text...

متن کامل

Using the Generic Document Profile to Cluster Similar Texts

The World Wide Web contains a huge quantity of text that is notoriously inefficient to use. This work aims to apply a text processing technique based on thesaurally derived lexical chains to improve Internet Information Retrieval where a lexical chain a set of words in a text that are related by both proximity, and by relations derived from an external lexical knowledge source such as WordNet, ...

متن کامل

Roget's Thesaurus: a Lexical Resource to Treasure

This paper presents the steps involved in creating an electronic lexical knowledge base from the 1987 Penguin edition of Roget’s Thesaurus. Semantic relations are labelled with the help of WordNet. The two resources are compared in a qualitative and quantitative manner. Differences in the organization of the lexical material are discussed, as well as the possibility of merging both resources.

متن کامل

Experiments in Topic Detection

Department of Mathematics and Computer Science University of Lethbridge 4401 University Drive Lethbridge, AB T1K 3M4, Canada E-mail: [email protected] Abstract Dividing documents into topically-coherent units and discovering their topic might have many uses. We present a system that proceeds in two steps: (1) the input text is segmented at places where there is a probable topic shift, (2) lexic...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003